2025-04-03
See Chapters 29-31 of our Course Notes
The Stanford University Heart Transplant Study1 examined whether an experimental heart transplant program increased lifespan. The heart_tr.sav data (saved as an SPSS file) includes 103 observations on 7 variables, as we’ll see.
survtime, which is the number of days the subject was alive after the date they were determined to be a candidate for heart transplant until the termination of the study.Our first step is to ingest the data, and build a tibble.
| Variable | Description |
|---|---|
id |
Random last name, assigned to the subject by Dr. Love |
age |
subject’s age in years at the start of the study |
survived |
survival status (alive or dead) at end of follow-up |
survtime |
days subject was alive from the start of the study to its end |
prior |
did the subject have a prior surgery (yes or no) |
transplant |
treatment (received a transplant) or control (did not) |
wait |
waiting time for transplant (in days) |
We’ll use the read_spss() function1, from the haven package.
heart_tr data# A tibble: 103 × 7
id age survived survtime prior transplant wait
<chr> <dbl> <dbl+lbl> <dbl> <dbl+lbl> <dbl+lbl> <dbl>
1 Akter 51 2 [dead] 6 2 [no] 2 [control] NA
2 Alcorn 30 2 [dead] 50 2 [no] 2 [control] NA
3 Ali 33 1 [alive] 1799 2 [no] 1 [treatment] 25
4 Alway 40 2 [dead] 39 2 [no] 1 [treatment] 36
5 Amadeo 20 2 [dead] 18 2 [no] 2 [control] NA
6 Aybar 54 2 [dead] 3 2 [no] 2 [control] NA
7 Bargiel 50 2 [dead] 675 2 [no] 1 [treatment] 51
8 Bayse 45 2 [dead] 40 2 [no] 2 [control] NA
9 Beason 47 2 [dead] 85 2 [no] 2 [control] NA
10 Boddicker 42 2 [dead] 58 2 [no] 1 [treatment] 12
# ℹ 93 more rows
<dbl+lbl> types for the columns containing categorical information.survived?# A tibble: 2 × 2
survived n
<dbl+lbl> <int>
1 1 [alive] 28
2 2 [dead] 75
Labels:
value label
1 alive
2 dead
survived.survived into a factor, without labelsheart_tr <- heart_tr |>
mutate(survived = fct_recode(factor(survived),
"alive" = "1", "dead" = "2"))
heart_tr |> count(survived) # now a factor, without labels# A tibble: 2 × 2
survived n
<fct> <int>
1 alive 28
2 dead 75
Now, we have a factor representation of the survived information, with values that make sense.
prior into a factor# A tibble: 2 × 2
prior n
<dbl+lbl> <int>
1 1 [yes] 12
2 2 [no] 91
prior, too.tranplant into a 0/1 indicatorInstead of making transplant into a factor, we’ll create a numeric description of the transplant information, where transplant = 1 will mean that the subject did have a heart transplant, and transplant = 0 will mean that the subject did not have a heart transplant.
transplant to 0/1heart_tr after our changesRows: 103
Columns: 7
$ id <chr> "Akter", "Alcorn", "Ali", "Alway", "Amadeo", "Aybar", "Barg…
$ age <dbl> 51, 30, 33, 40, 20, 54, 50, 45, 47, 42, 47, 53, 54, 53, 53,…
$ survived <fct> dead, dead, alive, dead, dead, dead, dead, dead, dead, dead…
$ survtime <dbl> 6, 50, 1799, 39, 18, 3, 675, 40, 85, 58, 153, 8, 81, 1386, …
$ prior <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no,…
$ transplant <dbl> 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1,…
$ wait <dbl> NA, NA, 25, 36, NA, NA, 51, NA, NA, 12, 26, NA, 17, 37, NA,…
survived and prior) or as a 1/0 numeric variable (transplant.)id was a set of (fake) last names, rather than numbers, this is already a character variable, which is what we want. Are they unique?survtime shows the in-study time (in days) until death or censoringsurvived is a factor showing “dead” or “alive”.[1] 6 50 1799+ 39 18 3
transplant groupsCall: survfit(formula = S ~ transplant, data = heart_tr)
transplant=0
time n.risk n.event survival std.err lower 95% CI upper 95% CI
1 34 1 0.9706 0.0290 0.9154 1.000
2 33 3 0.8824 0.0553 0.7804 0.998
3 30 3 0.7941 0.0693 0.6692 0.942
5 27 1 0.7647 0.0727 0.6346 0.921
6 26 2 0.7059 0.0781 0.5682 0.877
8 24 1 0.6765 0.0802 0.5362 0.853
9 23 1 0.6471 0.0820 0.5048 0.829
12 21 1 0.6162 0.0836 0.4723 0.804
16 20 1 0.5854 0.0849 0.4405 0.778
18 19 1 0.5546 0.0859 0.4094 0.751
21 18 2 0.4930 0.0867 0.3493 0.696
32 15 1 0.4601 0.0869 0.3177 0.666
35 14 1 0.4273 0.0867 0.2871 0.636
36 13 1 0.3944 0.0860 0.2572 0.605
37 12 1 0.3615 0.0849 0.2281 0.573
40 11 2 0.2958 0.0812 0.1727 0.507
50 9 1 0.2629 0.0786 0.1464 0.472
69 8 1 0.2301 0.0753 0.1211 0.437
85 7 1 0.1972 0.0714 0.0970 0.401
102 6 1 0.1643 0.0666 0.0743 0.364
149 5 1 0.1315 0.0609 0.0531 0.326
263 4 1 0.0986 0.0538 0.0338 0.287
340 3 1 0.0657 0.0448 0.0173 0.250
transplant=1
time n.risk n.event survival std.err lower 95% CI upper 95% CI
5 69 1 0.986 0.0144 0.9577 1.000
16 68 2 0.957 0.0246 0.9096 1.000
17 66 1 0.942 0.0281 0.8885 0.999
28 65 1 0.928 0.0312 0.8683 0.991
30 64 1 0.913 0.0339 0.8489 0.982
39 63 1 0.899 0.0363 0.8301 0.973
43 61 1 0.884 0.0386 0.8113 0.963
45 60 1 0.869 0.0407 0.7929 0.953
51 59 1 0.854 0.0426 0.7748 0.942
53 58 1 0.840 0.0443 0.7571 0.931
58 57 1 0.825 0.0459 0.7396 0.920
61 56 1 0.810 0.0474 0.7224 0.909
66 55 1 0.795 0.0488 0.7053 0.897
68 54 2 0.766 0.0512 0.6719 0.873
72 52 2 0.737 0.0533 0.6391 0.849
77 50 1 0.722 0.0543 0.6229 0.836
78 49 1 0.707 0.0551 0.6069 0.824
80 48 1 0.692 0.0559 0.5910 0.811
81 47 1 0.678 0.0566 0.5752 0.798
90 46 1 0.663 0.0573 0.5596 0.785
96 45 1 0.648 0.0579 0.5441 0.772
100 44 1 0.633 0.0584 0.5287 0.759
110 42 1 0.618 0.0589 0.5130 0.745
153 40 1 0.603 0.0594 0.4969 0.731
165 39 1 0.587 0.0599 0.4810 0.717
186 37 1 0.572 0.0603 0.4647 0.703
188 36 1 0.556 0.0607 0.4485 0.688
207 35 1 0.540 0.0610 0.4325 0.674
219 34 1 0.524 0.0613 0.4166 0.659
285 32 2 0.491 0.0616 0.3840 0.628
308 30 1 0.475 0.0617 0.3680 0.613
334 29 1 0.458 0.0617 0.3521 0.597
342 27 1 0.441 0.0617 0.3356 0.581
583 20 1 0.419 0.0625 0.3132 0.562
675 16 1 0.393 0.0638 0.2860 0.540
733 15 1 0.367 0.0647 0.2597 0.519
852 13 1 0.339 0.0656 0.2317 0.495
979 10 1 0.305 0.0672 0.1979 0.470
995 9 1 0.271 0.0678 0.1660 0.442
1032 8 1 0.237 0.0672 0.1360 0.413
1386 5 1 0.190 0.0685 0.0935 0.385
Call:
survdiff(formula = S ~ transplant, data = heart_tr)
N Observed Expected (O-E)^2/E (O-E)^2/V
transplant=0 34 30 12.1 26.5 33.2
transplant=1 69 45 62.9 5.1 33.2
Chisq= 33.2 on 1 degrees of freedom, p= 8e-09
km_trggsurvplot(), andkm_trIf S(t) is the survival function, and time t is taken to be continuous, then the hazard function H(t) is defined as:
\[ S(t) = e^{H(t)} \mbox{ and, thus, } H(t) = -ln(S(t)) \]
Consider a subject in the heart transplant study who has a survival time of 1000 days. Let’s ignore the transplant group information for a moment.
Suppose we want to estimate H(t) across all subjects.
I’ll build something called H.est1, the inverse K-M estimate…
For our km_tr fit, we’d use
The Cox proportional hazards model fits survival data with a constant (not varying over time) covariate (here, transplant group) to a hazard function of the form:
\[ h(t|transplant) = h_0(t)exp(\beta_1 transplant) \]
where we estimate the unknown value of \(\beta_1\) and where \(h_0(t)\) is the baseline hazard which depends on time \(t\) but not on the transplant group.
fit1_trCall:
coxph(formula = S ~ transplant, data = heart_tr)
coef exp(coef) se(coef) z p
transplant -1.3234 0.2662 0.2438 -5.428 5.69e-08
Likelihood ratio test=25.95 on 1 df, p=3.511e-07
n= 103, number of events= 75
Our hazard ratio estimate is 0.2662 for transplant group 1 (vs. transplant group 0)
fit1_tr)Parameter | Coefficient | SE | 90% CI | z | p
---------------------------------------------------------------------
transplant | -1.323 | 0.244 | [-1.724, -0.922] | -5.428 | < .001
Parameter | Coefficient | SE | 90% CI | z | p
-------------------------------------------------------------------
transplant | 0.266 | 0.065 | [0.178, 0.398] | -5.428 | < .001
tidy() results from broom?fit1_tr PerformanceResponse residuals not available to calculate mean square error. (R)MSE
is probably not reliable.
Warning: Can't calculate weighted residuals from model.
# Indices of model performance
AIC | AICc | BIC | Nagelkerke's R2 | RMSE | Sigma
-------------------------------------------------------------
572.297 | 572.336 | 574.931 | 0.223 | 0.947 | 0.000
glance() from broom?| n | nevent | statistic.log | p.value.log | statistic.sc | p.value.sc | statistic.wald | p.value.wald | statistic.robust | p.value.robust | r.squared | r.squared.max | concordance | std.error.concordance | logLik | AIC | BIC | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 103.00 | 75.00 | 25.95 | 0.00 | 33.38 | 0.00 | 29.47 | 0.00 | NA | NA | 0.22 | 1.00 | 0.67 | 0.02 | −285.15 | 572.30 | 574.61 | 75.00 |
fit1_tr modelIf the proportional hazards assumption is appropriate, we should see a slope of essentially zero in the residuals that are plotted against time on the next slide.
fit1_tr)age?Call:
coxph(formula = S ~ transplant + age, data = heart_tr)
coef exp(coef) se(coef) z p
transplant -1.78994 0.16697 0.27107 -6.603 4.02e-11
age 0.06020 1.06205 0.01527 3.943 8.03e-05
Likelihood ratio test=44.55 on 2 df, p=2.121e-10
n= 103, number of events= 75
fit2_tr model parametersParameter | Coefficient | SE | 90% CI | z | p
-------------------------------------------------------------------
transplant | 0.167 | 0.045 | [0.107, 0.261] | -6.603 | < .001
age | 1.062 | 0.016 | [1.036, 1.089] | 3.943 | < .001
fit2_tr R-square measures# Indices of model performance
AIC | AICc | BIC | Nagelkerke's R2 | RMSE | Sigma
-------------------------------------------------------------
555.695 | 555.815 | 560.965 | 0.352 | 0.880 | 0.000
model_parameters() gives the Nagelkerke \(R^2\).glance(fit2_tr) |> select(n, nevent, nobs, r.squared, r.squared.max) |>
gt() |> fmt_number(columns = r.squared:r.squared.max, decimals = 3) |>
tab_options(table.font.size = 20) |> opt_stylize(style = 5, color = "blue")| n | nevent | nobs | r.squared | r.squared.max |
|---|---|---|---|---|
| 103 | 75 | 75 | 0.351 | 0.997 |
glance() gives the Cox-Snell \(R^2\) along with its maximum value (< 1.)fit2_tr concordance measureglance(fit2_tr) |> select(n, nevent, nobs, concordance,
se_conc = std.error.concordance) |>
gt() |> fmt_number(columns = concordance:se_conc, decimals = 3) |>
tab_options(table.font.size = 20) |> opt_stylize(style = 5, color = "blue")| n | nevent | nobs | concordance | se_conc |
|---|---|---|---|---|
| 103 | 75 | 75 | 0.721 | 0.034 |
Compare the model’s prediction for a pair of observations in the data. A pair is concordant if the prediction and data agree in direction. Concordance is the fraction of pairs that are concordant.
fit2_tr modelage?fit1_tr includes transplant group, fit2_tr adds age.fit1_tr to fit2_trfit2_tr)fit3_tr adding prior surgeryCall:
coxph(formula = S ~ transplant + age + prior, data = heart_tr)
coef exp(coef) se(coef) z p
transplant -1.66121 0.18991 0.27588 -6.021 1.73e-09
age 0.05919 1.06098 0.01494 3.962 7.44e-05
priorno 0.74266 2.10152 0.44225 1.679 0.0931
Likelihood ratio test=47.89 on 3 df, p=2.247e-10
n= 103, number of events= 75
fit3_tr model parametersParameter | Coefficient | SE | 90% CI | z | p
-------------------------------------------------------------------
transplant | 0.190 | 0.052 | [0.121, 0.299] | -6.021 | < .001
age | 1.061 | 0.016 | [1.035, 1.087] | 3.962 | < .001
priorno | 2.102 | 0.929 | [1.015, 4.350] | 1.679 | 0.093
fit3_tr R-square measures# Indices of model performance
AIC | AICc | BIC | Nagelkerke's R2 | RMSE | Sigma
-------------------------------------------------------------
554.352 | 554.595 | 562.256 | 0.373 | 0.893 | 0.000
glance(fit3_tr) |> select(n, nevent, nobs, r.squared, r.squared.max) |>
gt() |> fmt_number(columns = r.squared:r.squared.max, decimals = 3) |>
tab_options(table.font.size = 20) |> opt_stylize(style = 5, color = "blue")| n | nevent | nobs | r.squared | r.squared.max |
|---|---|---|---|---|
| 103 | 75 | 75 | 0.372 | 0.997 |
fit3_tr concordanceglance(fit3_tr) |> select(n, nevent, nobs, concordance,
se_conc = std.error.concordance) |>
gt() |> fmt_number(columns = concordance:se_conc, decimals = 3) |>
tab_options(table.font.size = 20) |> opt_stylize(style = 5, color = "blue")| n | nevent | nobs | concordance | se_conc |
|---|---|---|---|---|
| 103 | 75 | 75 | 0.739 | 0.031 |
fit3_trfit3_trWe could add a non-linear predictor term or use a different kind of survival model.
If the PH assumption fails on a categorical predictor, fit a Cox model stratified by that predictor (use strata(var) rather than var in the specification of the coxph model.)
If the PH assumption is violated, this means the hazard isn’t constant over time, so we could fit separate Cox models for a series of time intervals.
Another option would be to use an extension of the Cox model that permits covariates to vary over time.
Visit https://cran.r-project.org/web/packages/survival/vignettes/timedep.pdf for details on building the relevant data sets and models, with examples.
432 Class 22 | 2025-04-03 | https://thomaselove.github.io/432-2025/